Storm data analysis

This is an analysis report of storm data. There are two questions we want to analyze, they are:

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Require package

to complete this analysis, we applied 3 packages to help.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(ggplot2)

In this report, there are four parts of the chapter, first is Synopsis, second is Data Processing, third is Results, and fourth is summary.

Synopsis

This report is to analyze the damage caused by climate disasters in the United States. We used storm data source from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, our goals are 1. find which types of events are the most harmful with respect to population health, 2. find which types of events have the greatest economic?
After our analysis, we find that tornado is the most harmful events to population, they caused 5633 fatalities and 91346 injuries. on the other hand, floods have greatest economic consequence.

Data Processing

Firstly, we read the data from the original bz2 file and access it into a dataset named “rawStorm”.

rawStorm <- tbl_df(read_csv("repdata_data_StormData.csv.bz2"))
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   BGN_DATE = col_character(),
##   BGN_TIME = col_character(),
##   TIME_ZONE = col_character(),
##   COUNTYNAME = col_character(),
##   STATE = col_character(),
##   EVTYPE = col_character(),
##   BGN_AZI = col_logical(),
##   BGN_LOCATI = col_logical(),
##   END_DATE = col_logical(),
##   END_TIME = col_logical(),
##   COUNTYENDN = col_logical(),
##   END_AZI = col_logical(),
##   END_LOCATI = col_logical(),
##   PROPDMGEXP = col_character(),
##   CROPDMGEXP = col_logical(),
##   WFO = col_logical(),
##   STATEOFFIC = col_logical(),
##   ZONENAMES = col_logical(),
##   REMARKS = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 5255570 parsing failures.
##  row col           expected actual                             file
## 1671 WFO 1/0/T/F/TRUE/FALSE     NG 'repdata_data_StormData.csv.bz2'
## 1673 WFO 1/0/T/F/TRUE/FALSE     NG 'repdata_data_StormData.csv.bz2'
## 1674 WFO 1/0/T/F/TRUE/FALSE     NG 'repdata_data_StormData.csv.bz2'
## 1675 WFO 1/0/T/F/TRUE/FALSE     NG 'repdata_data_StormData.csv.bz2'
## 1678 WFO 1/0/T/F/TRUE/FALSE     NG 'repdata_data_StormData.csv.bz2'
## .... ... .................. ...... ................................
## See problems(...) for more details.
str(rawStorm)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : logi  NA NA NA NA NA NA ...
##  $ BGN_LOCATI: logi  NA NA NA NA NA NA ...
##  $ END_DATE  : logi  NA NA NA NA NA NA ...
##  $ END_TIME  : logi  NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : logi  NA NA NA NA NA NA ...
##  $ END_LOCATI: logi  NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : num  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: logi  NA NA NA NA NA NA ...
##  $ WFO       : logi  NA NA NA NA NA NA ...
##  $ STATEOFFIC: logi  NA NA NA NA NA NA ...
##  $ ZONENAMES : logi  NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : logi  NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 5255570 obs. of  5 variables:
##   ..$ row     : int  1671 1673 1674 1675 1678 1679 1680 1681 1682 1683 ...
##   ..$ col     : chr  "WFO" "WFO" "WFO" "WFO" ...
##   ..$ expected: chr  "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
##   ..$ actual  : chr  "NG" "NG" "NG" "NG" ...
##   ..$ file    : chr  "'repdata_data_StormData.csv.bz2'" "'repdata_data_StormData.csv.bz2'" "'repdata_data_StormData.csv.bz2'" "'repdata_data_StormData.csv.bz2'" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   STATE__ = col_double(),
##   ..   BGN_DATE = col_character(),
##   ..   BGN_TIME = col_character(),
##   ..   TIME_ZONE = col_character(),
##   ..   COUNTY = col_double(),
##   ..   COUNTYNAME = col_character(),
##   ..   STATE = col_character(),
##   ..   EVTYPE = col_character(),
##   ..   BGN_RANGE = col_double(),
##   ..   BGN_AZI = col_logical(),
##   ..   BGN_LOCATI = col_logical(),
##   ..   END_DATE = col_logical(),
##   ..   END_TIME = col_logical(),
##   ..   COUNTY_END = col_double(),
##   ..   COUNTYENDN = col_logical(),
##   ..   END_RANGE = col_double(),
##   ..   END_AZI = col_logical(),
##   ..   END_LOCATI = col_logical(),
##   ..   LENGTH = col_double(),
##   ..   WIDTH = col_double(),
##   ..   F = col_double(),
##   ..   MAG = col_double(),
##   ..   FATALITIES = col_double(),
##   ..   INJURIES = col_double(),
##   ..   PROPDMG = col_double(),
##   ..   PROPDMGEXP = col_character(),
##   ..   CROPDMG = col_double(),
##   ..   CROPDMGEXP = col_logical(),
##   ..   WFO = col_logical(),
##   ..   STATEOFFIC = col_logical(),
##   ..   ZONENAMES = col_logical(),
##   ..   LATITUDE = col_double(),
##   ..   LONGITUDE = col_double(),
##   ..   LATITUDE_E = col_double(),
##   ..   LONGITUDE_ = col_double(),
##   ..   REMARKS = col_logical(),
##   ..   REFNUM = col_double()
##   .. )

The rawStorm data content with 37 variables and 1773320 objects. However, we want to analysis which event is most harmful to human, so we only need an object with events.

rawStorm<- filter(rawStorm,EVTYPE!="",EVTYPE!=" ",!is.na(EVTYPE))
rawStorm %>% group_by(EVTYPE)

Results

Because we have two different questions to analyze, so we split the data into two data sets which contain different variables.

1 Across the United States, which types of events are most harmful with respect to population health?

To find out most harmful with respect to population health, we need to analysis fatalities and injuries. we summary total fatalities and total injures by each event, and show the top 3 events.

harmful_with_health<-select(rawStorm ,EVTYPE,FATALITIES,INJURIES)%>%
                      group_by(EVTYPE)%>%
                      summarize(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES))%>%
                      arrange(desc(total_fatalities),desc(total_injuries))%>%
                      head(3)
# cost_of_economic<-rawStorm [,c("X.EVTYPE.", "X.PROPDMG.","X.CROPDMG.")]
harmful_with_health

to show the comparison of total fatalities and total injuries, we apply the bar plot as follow:

2 Across the United States, which types of events have the greatest economic consequences?

To find out the greatest economic consequences, we need to analysis property damage and crop damage. However, the variable “PROPDMGEXP” and “CROPDMGEXP” are use letters to represent the multiples, so we have to change they into numbers, so the can be in the same units, after that we arrange those damage and show the top 5.

cost_of_economic<-select(rawStorm ,EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)%>%
                      mutate(prop_exp=ifelse(PROPDMGEXP=="B"|PROPDMGEXP=="b",1000000000,
                                         ifelse(PROPDMGEXP=="M"|PROPDMGEXP=="m",1000000,
                                                ifelse(PROPDMGEXP=="K"|PROPDMGEXP=="k",1000,
                                                       ifelse(PROPDMGEXP=="H"|PROPDMGEXP=="h",100,
                                                              1)))),
                         crop_exp=ifelse(CROPDMGEXP=="B"|CROPDMGEXP=="b",1000000000,
                                         ifelse(CROPDMGEXP=="M"|CROPDMGEXP=="m",1000000,
                                                ifelse(CROPDMGEXP=="K"|CROPDMGEXP=="k",1000,
                                                       ifelse(CROPDMGEXP=="H"|CROPDMGEXP=="h",100,
                                                              1)))),
                         prop_dm=PROPDMG*prop_exp,crop_dm=CROPDMG*crop_exp)%>%
                          group_by(EVTYPE)%>%
                          summarize(total_prop_dm=sum(prop_dm,na.rm = T),total_crop_dm=sum(crop_dm,na.rm = T))%>%
                      arrange(desc(total_prop_dm),desc(total_crop_dm))%>%
                      head(5)

cost_of_economic

Summary

In this report, we can see that tornado caused the most harmful with population health, which caused 5633 fatalities and 91346 injuries, st the same time, they also cause a lot of property damages. Another serious disaster is floods, which have caused the greatest damages of economic cost, and a lot of casualties too.